Goto

Collaborating Authors

 Gardner


PCMind-2.1-Kaiyuan-2B Technical Report

Luo, Kairong, Sun, Zhenbo, Shi, Xinyu, Chen, Shengqi, Yu, Bowen, Chen, Yunyi, Dang, Chenyi, Tao, Hengtao, Wang, Hui, Liu, Fangming, Lyu, Kaifeng, Chen, Wenguang

arXiv.org Artificial Intelligence

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.


Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Ali, Mehdi, Brack, Manuel, Lübbering, Max, Wendt, Elias, Khan, Abbas Goher, Rutmann, Richard, Jude, Alex, Kraus, Maurice, Weber, Alexander Arno, Kaczér, David, Mai, Florian, Flek, Lucie, Sifa, Rafet, Flores-Herr, Nicolas, Köhler, Joachim, Schramowski, Patrick, Fromm, Michael, Kersting, Kristian

arXiv.org Artificial Intelligence

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.


Craw4LLM: Efficient Web Crawling for LLM Pretraining

Yu, Shi, Liu, Zhiyuan, Xiong, Chenyan

arXiv.org Artificial Intelligence

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.


Full-Head Segmentation of MRI with Abnormal Brain Anatomy: Model and Data Release

Birnbaum, Andrew M, Buchwald, Adam, Turkeltaub, Peter, Jacks, Adam, Huang, Yu, Datta, Abhisheck, Parra, Lucas C, Hirsch, Lukas A

arXiv.org Artificial Intelligence

The goal of this work was to develop a deep network for whole-head segmentation, including clinical MRIs with abnormal anatomy, and compile the first public benchmark dataset for this purpose. We collected 91 MRIs with volumetric segmentation labels for a diverse set of human subjects (4 normal, 32 traumatic brain injuries, and 57 strokes). These clinical cases are characterized by extended cerebrospinal fluid (CSF) in regions normally containing the brain. Training labels were generated by manually correcting initial automated segmentations for skin/scalp, skull, CSF, gray matter, white matter, air cavity, and extracephalic air. We developed a MultiAxial network consisting of three 2D U-Net models that operate independently in sagittal, axial, and coronal planes and are then combined to produce a single 3D segmentation. The MultiAxial network achieved test-set Dice scores of 0.88 (median plus-minus 0.04). For brain tissue, it significantly outperforms existing brain segmentation methods (MultiAxial: 0.898 plus-minus 0.041, SynthSeg: 0.758 plus-minus 0.054, BrainChop: 0.757 plus-minus 0.125). The MultiAxial network gains in robustness by avoiding the need for coregistration with an atlas. It performed well in regions with abnormal anatomy and on images that have been de-identified. It enables more robust current flow modeling when incorporated into ROAST, a widely-used modeling toolbox for transcranial electric stimulation. We are releasing a state-of-the-art model for whole-head MRI segmentation, along with a dataset of 61 clinical MRIs and training labels, including non-brain structures. Together, the model and data may serve as a benchmark for future efforts.


WavePulse: Real-time Content Analytics of Radio Livestreams

Mittal, Govind, Gupta, Sarthak, Wagle, Shruti, Chopra, Chirag, DeMattee, Anthony J, Memon, Nasir, Ahamad, Mustaque, Hegde, Chinmay

arXiv.org Artificial Intelligence

Radio remains a pervasive medium for mass information dissemination, with AM/FM stations reaching more Americans than either smartphone-based social networking or live television. Increasingly, radio broadcasts are also streamed online and accessed over the Internet. We present WavePulse, a framework that records, documents, and analyzes radio content in real-time. While our framework is generally applicable, we showcase the efficacy of WavePulse in a collaborative project with a team of political scientists focusing on the 2024 Presidential Elections. We use WavePulse to monitor livestreams of 396 news radio stations over a period of three months, processing close to 500,000 hours of audio streams. These streams were converted into time-stamped, diarized transcripts and analyzed to track answer key political science questions at both the national and state levels. Our analysis revealed how local issues interacted with national trends, providing insights into information flow. Our results demonstrate WavePulse's efficacy in capturing and analyzing content from radio livestreams sourced from the Web. Code and dataset can be accessed at \url{https://wave-pulse.io}.


A Reinforcement Learning Approach to Domain-Knowledge Inclusion Using Grammar Guided Symbolic Regression

Crochepierre, Laure, Boudjeloud-Assala, Lydia, Barbesant, Vincent

arXiv.org Artificial Intelligence

In recent years, symbolic regression has been of wide interest to provide an interpretable symbolic representation of potentially large data relationships. Initially circled to genetic algorithms, symbolic regression methods now include a variety of Deep Learning based alternatives. However, these methods still do not generalize well to real-world data, mainly because they hardly include domain knowledge nor consider physical relationships between variables such as known equations and units. Regarding these issues, we propose a Reinforcement-Based Grammar-Guided Symbolic Regression (RBG2-SR) method that constrains the representational space with domain-knowledge using context-free grammar as reinforcement action space. We detail a Partially-Observable Markov Decision Process (POMDP) modeling of the problem and benchmark our approach against state-of-the-art methods. We also analyze the POMDP state definition and propose a physical equation search use case on which we compare our approach to grammar-based and non-grammarbased symbolic regression methods. The experiment results show that our method is competitive against other state-of-the-art methods on the benchmarks and offers the best error-complexity trade-off, highlighting the interest of using a grammar-based method in a real-world scenario.